Distributed Systems: The Complete Guide
1. What is a Distributed System?
A distributed system is a collection of independent computers (nodes/servers) that appear to the user as a single system. These nodes communicate via a network and work together to achieve a common goal.
Examples
- Google Search → billions of queries handled by thousands of servers globally
- Netflix → streaming movies from servers close to you (CDNs)
2. Why Distributed Systems?
- Scalability → handle millions of users by adding servers
- Fault Tolerance → if one server fails, the system keeps running
- Low Latency → bring services closer to users (e.g., CDNs)
- Cost Efficiency → commodity hardware instead of giant supercomputers
3. Key Characteristics
- Transparency: Users don't know if data is on one machine or many
- Concurrency: Many users/tasks run simultaneously
- Fault tolerance: Survives machine/network failures
- Scalability: Can grow horizontally by adding more machines
4. Challenges in Distributed Systems
- Network latency & partitioning (messages may be delayed, dropped, or duplicated)
- Consistency across replicas (everyone should see the same data)
- Fault tolerance (what if a server crashes during a transaction?)
- Coordination between nodes
- Security (data traveling across networks)
5. Core Concepts
CAP Theorem
Any distributed system can only guarantee two out of three:
- Consistency → every user sees the same data
- Availability → system always responds
- Partition tolerance → works even if network splits
Examples:
- CP (Consistency + Partition Tolerance) → MongoDB, HBase
- AP (Availability + Partition Tolerance) → DynamoDB, Cassandra
Data Replication
- Master-Slave (Primary-Replica): one node writes, others read
- Multi-Master: multiple nodes can write (more complex conflict resolution)
Consensus
How do nodes agree on a value despite failures?
- Paxos
- Raft
- ZAB (used in ZooKeeper)